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Introduction 

This  paper  studies  load  balancing  in  a  muliiproccssor  environment  We  consider  a  shared  memory 
model  similar  to  the  one  described  in  [3].  Briclly,  ilicrc  arc  k  processors,  each  with  unlimited  individual 
memory,  and  a  segment  of  shared  memory.  Tasks  arrive,  or  are  generated  by  ongoing  tasks,  at  unpredictable 
instants,  and  their  computation  times  arc  also  unpredictable.  We  assume  all  tasks  are  independent  of  each 
other  and  may  be  assigned  in  any  order  to  any  processor.  The  k  processors  must  perform  these  tasks 
promptly,  and  at  the  same  time,  with  minimum  inicrrcrencc  from  each  other.  Any  load  balancing  algorithm 
must  keep  track  of  the  loading  conditions  of  the  processors  and  assign  a  suitable  number  of  available  tasks 
whenever  a  processor  becomes  free.  This  dynamic  information  is  maintained  in  the  shared  memory.  We 
assume,  as  in  [3],  that  all  processors  execute  the  same  load  balancing  algorithm  and  access  the  shared 
memory  asynchronously.  This  leads  to  the  possibility  of  interference  or  collisions.  The  goal  of  our  algorithm 
is  to  minimise  this  interference,  while  keeping  all  processors  busy.  As  in  [3],  we  assume  that  the  number  of 
potential  collisions  is  the  sole  performance  criterion. 

Here,  we  present  an  algorithm  that  performs  dynamic  task  scheduling.  For  k  processors  to  process  n 
tasks,  our  algorithm  incurs  O  {k  log  k  log  n)  collisions  in  the  worst  case.  This  is  an  improvement  on 
algorithm  in  [3],  which  incurs  Oik^logn)  (e>  1)  collisions  in  the  worst  case. 


The  Problem 

One  approach  to  the  problem  is  to  retain  all  tasks  (or  their  descriptors)  in  the  shared  memory.  Each 
processor  is  assigned  only  one  task,  and  on  its  completion  the  processor  accesses  the  shared  memory  to 
acquire  another  task.  There  arc  two  disadvantages  to  this  strategy.  First,  if  a  task  generates  a  new  task,  then 
the  new  task  must  be  "reported"  or  "put"  into  the  shared  memory.  Second,  at  the  completion  of  a  task  each 
processor  must  access  the  shared  memory  to  obtain  a  new  task.  Both  events  can  lead  to  collisions.  If  there 


were  n  tasks  to  be  processed,  there  could  be  2n  (i.e.,  O  (n ))  collisions.  Hence  when  reassigning  tasks  to 


processors  one  must  assign  not  one  but  a  certain  number  depending  on  the  dynamic  conditions.  But  if  several 

Codas 
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tasks  are  being  assigned  to  processors,  then  idle  processors  must  not  only  look  in  the  shared  memory  but  also 
examine  the  loads  at  other  processors.  Here,  by  the  load  of  a  processor  we  mean  the  number  of  tasks  assigned 
to  it  Idle  processors  must  then  "visit"  overloaded  processors  and  relieve  some  of  their  load.  This  "visit",  as  it 
is  interference,  is  also  counted  as  a  collision.  Thus  in  our  worst  case  analysis,  the  possibility  of  interference 
or  collision  is  termed  as  a  collision.  >From  our  analysis  above,  any  scheduling  algorithm  must  handle  two 
issues: 

(i)  Once  a  processor  finds  itself  idle,  it  must  arrange  to  "visit"  another  processor  for  sharing  its  load.  Thus 
there  must  be  a  policy  to  decide  which  processor  to  visit. 

(ii)  Once  processor  pi  decides  to  visit  processor  p; ,  efficient  data  structures  should  ensure  that  load  sharing  is 
done  speedUy.  Processor  p,  must  also  decide  what  fraction  of  processor  pj ’s  load  it  should  take. 

We  briefly  sketch  Manber’s  algorithm  here.  Wc  refer  the  reader  to  [3]  for  a  complete  discussion  of  the 
above.  Our  algorithm  differs  from  Manber’s  only  in  its  way  of  handling  (i)  above. 

Each  processor  Pi  has  an  individual  area  in  die  shared  memory,  called  its  local  segment,  in  which  pi 
stores  the  tasks  (or  their  descriptors)  assigned  to  p, .  The  data  structure  here  is  similar  to  a  binary  tree  which 
allows  addition,  deletion,  and  splitting  in  constant  (i.c.,  O  (1))  time.  The  split  operation  splits  the  tree  into 
two  trees  whose  sizes  are  between  1/3  and  2/3  of  the  original.  Whenever  a  processor  finishes  its  current  task, 
it  accesses  its  local  segment  first.  If  the  local  segment  is  not  empty,  then  the  processor  picks  up  a  task  from 
its  local  segment  and  deletes  the  task  (or  the  task’s  descriptor)  from  its  local  segment.  If  the  currently 
executing  task  generates  more  tasks,  or  if  some  external  tasks  arrive,  then  the  processor  puts  the  new  tasks 
into  its  local  segment  When  a  processor  finds  its  local  segment  empty,  however,  it  proceeds  to  the  global 
memory  to  find  a  processor  that  has  a  non-empty  local  segment.  This  local  segment  is  split,  and  the  visitor 
takes  a  part  of  the  host’s  local  segment  into  its  own  local  segment.  We  say  that  some  of  the  host’s  tasks  have 
migrated  to  the  visitor. 

Processor  p,-  is  busy  if  it  is  processing  a  task  or  its  local  segment  is  non-empty;  otherwise  p,  is  idle  A 
global  data  stracture  maintains  the  status  (i.e.,  busy  or  idle)  of  each  processor.  The  data  structure  of  [3]  is 
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organised  as  an  m  -ary  tree  with  each  of  the  k  processors  as  a  leaf.  Each  internal  node  maintains  the  loading 
conditions  of  its  leaves.  Then  finding  a  busy  processor  takes  O  (k^)  collisions  (see  [3]  for  details).  The 
number  e  approaches  1  as  m  increases.  Thus  for  k  processors  to  finish  n  tasks,  this  scheme  incurs 
O  {k^  log  n )  collisions  in  the  worst  case.  Also.  Manber  established  a  lower  bound  of  Q(A:  log  n )  collisions. 

Our  global  data  structure,  which  resides  in  the  shared  memory,  is  a  Fibonacci  heap  [2].  As  an  abstract 
data  structure  on  k  objects  {j:,-  }  with  real  weights  {vv, },  a  Fibonacci  heap  supports  the  following  operations: 

max:  find  the  maximum  weighted  object. 

incr(x,r):  increment  the  weight  of  x  by  positive  real  r. 

decr  (x  ,r ):  decrement  the  weight  of  x  by  positive  real  r. 

The  max  and  deer  operations  take  0(1)  (amortized)  time,  and  the  incr  operation  takes  O  Oog  it)  time.  Hence 
each  operation  involves  0  Gog  k )  memory  accesses.  We  use  a  Fibonacci  heap  F  to  maintain  the  processor 
loading  information.  Each  processor  p,-  is  an  object,  and  its  reported  load  f?,  is  the  weight  associated  with  pi . 

For  ease  of  analysis,  we  first  present  a  simple  version  of  our  algorithm.  Once  again,  our  algorithm 
differs  from  [3]  only  in  how  it  finds  a  processor  to  visit. 

Each  processor  is  individually  responsible  lor  updating  its  loading  information  in  F .  To  minimize 
updating  collisions  when  two  processors  try  to  access  F  at  the  same  time,  however,  the  processors  report  only 
increases  in  their  previously  reported  loads.  Thus  at  any  time,  the  reported  load  is  always  an  overestimate  of 
the  actual  load.  When  processor  p,  completes  a  task,  it  first  checks  its  local  segment.  If  that  is  empty,  then 
Pi  accesses  F  and  performs  the  operation  max  to  obtain  the  name  of  the  processor  py  whose  reported  load  is 
the  largest.  After p,-  visits  pj ,  processor  p,  has  between  one-third  and  two-thirds  of  pj ’s  old  load  and  pj  has 
the  remainder.  Processor  p,-  uses  incr  and  deer  to  record  die  new  loads  of  p,  and  pj  in  the  F.  Note  that 
though  visiting  the  most  heavily  loaded  processor  seems  to  be  a  good  policy  for  minimising  future  visits,  it 
entails  maintaining  fairly  accurate  loading  conditions  in  the  global  memory.  This  in  turn  requires  processors 
to  report  their  loads,  which  is  collision-prone.  So  then  there  are  two  types  of  collisions:  reporting  collisions 
and  visiting  collisions.  We  estimate  them  separately.  Note  that  when  a  processor  completes  a  task  and 
deletes  the  task  from  its  local  segment,  the  access  to  the  local  segment  is  not  counted  as  a  potential  collision. 
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Visiting  Collisions 

We  begin  with  estimating  visiting  collisions  in  our  scheme.  First,  a  "static"  result.  Assume  that  there 
arc  n  tasks,  and  these  tasks  are  distributed  over  the  local  segments  of  the  k  processors.  Further,  no  new  tasks 
are  generated  within,  or  added  to,  the  system,  so  no  reporting  collisions  occur. 

Theorem  1.  In  the  static  case,  if  there  are  n  tasks  in  a  ^ -processor  system,  then  in  the  worst  case  the 
processors  complete  all  tasks  with  O  (,k  log  k  log  n )  collisions. 

Proof.  Rrst  note  that  every  operation  on  F  takes  O  (log  k)  time,  hence  O  Oog  k)  collisions,  in  the  worst 
case.  We  consider  each  such  operation  as  a  unit  step  and  proceed  to  estimate  the  number  of  steps. 

By  convention,  the  computation  begins  at  step  0.  Let 

/?,  (r)=  the  reported  load  of  processor  i  at  step  t. 

Li  (t  )=  the  actual  load  of  processor  i  at  step  t . 

m(t)=maxj{Rj(t)}. 

Thus  m  (r )  is  the  largest  weight  in  F  at  step  t .  If  processor  pi  visits  processor  pj  at  step  to.  then  pi  takes 
between  one-third  and  two-thirds  of  pj ’s  load.  In  our  notation, 

Rido)  =  Li(to)  <  ^Lj(to-V)  <  ^Rjdr-l), 

Rjdo)  —  Lj(tQ)  <  ^Lj(to—\)^  ■y/?y(ror-l). 

Hence 


Ri(to)  <  /n(to-l)  and  Rj (to)  ^m(t o^l). 

Now  consider  m  (to+k ),  i.e.,  the  maximum  reported  load  after  k  steps  (k  potential  visiting  collisions).  We 
claim  that  m(ro+ifc)  ^  -jm  (to).  Observe  that  at  step  to  there  can  be  at  most  k  processors  with  reported  load 

greater  than  -j  m  (r o).  and  that  the  processor  with  the  largest  reported  load  is  visited  at  each  step.  Hence  in 

the  k  visits  following  to  all  the  processors  with  load  greater  than  -jin  (to)  must  be  visited  and  our  claim  must 
hold.  But  m  (0)  ^  ,  so  the  number  of  steps  in  the  computation  is  bounded  by  O  (k  log  n ).  □ 
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In  the  static  case  above,  there  were  no  additional  tasks  or  reporting  collisions  to  complicate  our  analysis. 
Nevertheless,  the  bound  on  the  visiting  collisions  holds  even  in  the  dynamic  case  when  visiting  collisions  are 
interspersed  with  reporting  collisions. 

Theorem  2.  Under  dynamic  conditions,  in  the  worst  case,  k  processors  complete  n  tasks  with  at  most 
Oik  log  it  log  rt)  visiting  collisions. 

Proof.  In  the  proof  of  Theorem  1,  since  no  tasks  were  introduced  into  the  system,  mit)  was  a 
monotonic  decreasing  function.  This  is  not  true  in  the  dynamic  case.  Let  us  assume  that  all  the  n  tasks  in  the 
system  at  step  0  are  blue  in  colour,  and  all  tasks  generated  or  added  are  green .  So  at  any  step  in  the 
computation  each  processor  has  a  mix  of  blue  and  green  tasks.  Further,  as  a  theoretical  convenience,  we 
assume  that  when  p,-  visits  pj ,  processor  pi  takes  equal  proportions  of  blue  and  green  tasks. 

In  addition  to  the  previously  defmed  functions  /?,(r),  L,(r),  mit),  let  us  define  /?',  (f),L',(r),  m'(r) 
considering  only  the  blue  jobs.  Thus,  for  example,  R it)  is  the  reported  blue  load  of  processor  i  at  step  t . 
Further,  define  « '(r )  as  the  number  of  blue  tasks  in  the  system  at  step  t .  Again,  by  a  step  we  mean  a  single 
Fibonacci  heap  operation  (which  incurs  O  Gog  k)  collisions  in  the  worst  case).  So  at  any  step  t , 

Our  line  of  proof  is  as  follows.  We  show  that  in  P)fc  steps  after  step  t  (for  a  constant  p>l  to  be  chosen 
later),  either  n  'it )  tasks  finish  execution  or  m  'it  +p/:  )<y  m  'it ) .  This  would  imply  that  some  n  jobs,  blue  or 
green,  have  been  processed  in  O  ik  log  n )  steps  following  step  t .  Now  at  time  t ,  there  can  be  at  most  k 
processors  with/?', (r)  >  ym'(r).  Um'it+^k)  >  y m '(f),  then  clearly  there  must  exist  a  processor py  such 

that  /?'y(f)> ym '(f)  and  /?'y(f+p/:)  >  ym'(f ).  Since  equal  proportions  of  blue  and  green  tasks  migrate  at 

every  visit,  we  conclude  that  processor  p^  has  not  been  visited  in  the  p/:  steps  following  f .  As  our  policy  is  to 
visit  the  most  heavily  loaded  processor,  the  P-t  visits  must  have  involved  processors  whose  loads  were  larger 
than  /?'y(f).  Hence  each  visit  must  have  involved  a  migration  of  at  least  y/?'y(f)  >  ym'(f )  t  y 
tasks.  Also  note  that  between  successive  visits  by  processor  pi ,  p,  must  exeeute  all  tasks  acquired  at  the 
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previous  visit.  Now  as  there  are  k  processors  in  Uic  system,  and  there  have  been  p/k  visits,  at  least 
(p-l)/:  (  )  tasks  must  have  been  processed  in  the  meantime. 

Thus  for  p=l  1/2,  in  pik  steps  either  m  '{t )  decreases  to  two-thirds  of  its  original  value,  or  at  least  n  '{t ) 
tasks  are  processed.  Formally,  either  m  Xt +-^k )  <  -jm  '(r )  orn  '(r )  tasks  have  been  processed  between 

steps  t  and  r-t-^ik.  But  «'(t)=«  is  the  number  of  blue  tasks  in  the  system  at  step  r.  Thus  mO(k  log  n) 
steps  n  tasks  were  processed,  though  they  need  not  all  be  blue.  This  proves  the  theorem.  □ 

Reporting  Collisions 

Next,  we  estimate  reporting  collisions.  First,  note  that  our  proofs  of  Theorem  1  and  Theorem  2  were 
based  on  the  assertion  that  the  reported  load  is  alw  ays  an  overestimate  of  the  actual  load  of  processors.  This 
assertion,  in  turn,  was  based  on  the  assumption  that  processors  report  every  increase  in  their  previously 
reported  load.  Under  the  above  stated  conditions  we  showed  that  it  takes  O  (k  log  /k  log  n )  collisions,  in  the 
worst  case,  to  process  n  tasks.  Hence  if  O  (k  log  k  log  n )  reponing  collisions  suffice  to  introduce  n  tasks 
into  the  system,  then  our  simple  algorithm  would  incur  O  (k  log  k  log  n )  total  collisions.  But  this  would 
imply  a  "chunky"  task  arrival/gencration  process,  which  may  not  be  a  reasonable  assumption  in  many 
situations.  To  accommodate  this  possibility,  we  relax  our  requirement  that  processors  report  every  increase  in 
load  immediately.  Instead,  we  ask  that  if  a  processor’s  current  load  is  L ,  then  it  report  the  number  f  logpL]  , 
where  p  >  1  is  a  constant  to  be  detennined  later.  If  at  a  task  arrival/generation  this  number  does  not  change, 
then  the  processor  does  not  report  an  increase.  Thu.s  our  modified  reporting  rule  is  as  follows:  Do  not  report 
any  load  reductions;  report  increases  only  iff  logpZ.  ]  changes,  where  L  is  the  current  load  of  the  processor. 
Of  course,  if  p,-  visits  pj,  then  the  new  loading  conditions  of  both  processors  are  reported,  but  that  has  already 
been  counted  as  a  visiting  collision. 

Let  us  see  how  this  new  reporting  policy  affects  our  proofs  of  Theorems  1  and  2.  The  actual  load  in  the 
modified  reporting  scheme  is  at  most  p  times  the  reported  load.  Hence  as  long  as  ■jP<l,  i.e.,  p<-j. 
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Theorems  1  and  2  hold. 

First,  a  static  result.  Assume  that  there  are  n  tasks  in  the  k  -processor  system  at  time  tQ.  We  analyse  the 
worst  case  number  of  reporting  collisions  required  to  introduce  an  additional  n'  tasks.  Further  we  assume  that 
while  we  introduce  these  tasks,  no  visiting  collisions  occur.  Let  n,  be  the  load  of  processor  pi  at  time  to,  and 
of  the  n'  new  tasks,  let be  added  to  p, .  Then  wc  have: 

Theorem  3.  In  the  static  case,  in  a  -processor  system,  0  {k  log  k  log  n')  reporting  collisions  suffice  to 
add  n'  tasks  into  the  system,  in  the  worst  case. 

Proof.  Qearly,  in  the  static  case,  adding  n tasks  to  processor p,  takes  no  more  than 
logp  (/i',+/i,)-logp  (/tj)  ^  logp  n'i  reporting  steps.  Note  that  while  there  may  be  no  visiting  collisions,  there  is 
the  possibility  that  p,-  completes  some  of  its  assigned  tasks.  Under  our  reporting  scheme,  this  could  only  lead 
to  fewer  reports,  since  more  tasks  may  arrive  at  p,  before  p,  reaches  the  next  reporting  level.  Hence  the  total 

k 

number  of  reports  is  less  than  ^  log  n k  log  n Therefore  the  number  of  steps  required  is  O  {k  log  n  O, 

and  the  number  of  possible  collisions  is  O  (k  log  k  log  n □ 

In  the  dynamic  case,  as  a  theoretical  convenience,  we  assume  that  the  tasks  form  a  first-in  first-out 
(FIFO)  queue  in  front  of  their  processors.  Thus  if  tasks  bo,  b\,...,bs  are  in  processor  p  i ’s  queue,  then  bo 
entered  the  queue  the  earliest  and  b,  the  latest.  The  next  task  arriving  at  or  generated  for  p  i  sits  after  bs . 
Further,  we  assume  that  when p  i  has  Z?  i,...,b„',bm’+i,...,b„\m  in  its  queue,  and  a  visit  by  p2  takes  m  tasks 

from  p  1 ,  then  after  the  visit,  p  i  has  6 1 .....  -  and  p  2  has  '+i . Z?m  Vm  as  their  respective  queues.  Thus 

visits  do  not  disturb  the  order  of  the  migrating  task.s  or  of  the  tasks  left  behind.  Let  br  be  the  m  th  task  from 
the  head  of  the  queue  at  time  t  in  some  processor.  Define  the  level  number,  I  {br ,  ty=j  such  that  pi 

Qearly,  for  each  b,l{b,t)  only  decreases  with  time.  Further,  since  yp<  1 .  when  task  b  migrates  to  a 

visiting  processor,  l{b,t)  decreases  by  at  least  1  in  this  migration.  Let  lo{b )  denote  the  level  number  of  task 
b  when  it  first  entered  the  system.  Then  l{b,t )<l o{b )  for  all  tasks  b ,  at  any  instant  r .  In  the  following 
theorem,  we  make  a  steady  state  assumption  that  is  clarified  in  the  proof.  Why  we  need  to  make  such  an 
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assumption  is  discussed  in  the  next  section. 

Theorem  4.  At  time  t ,  let  processor  p  i  have  n  tasks  in  its  queue.  Then,  under  the  above  scheme,  and 
under  the  steady  state  assumption,  no  more  than  O  (log  k  log  n )  reporting  collisions  could  have  occurred 
while  introducing  the  n  tasks  into  the  system. 

Proof.  Here  again  we  consider  Fibonacci  heap  operations  as  steps.  Lttbi, . . .  be  inp  I’s  queue  at 

time  t.  Let  0=mo,  mi . m,  be  numbers  such  that  originated  at  processorpp  for  all 

O^y^-1  where Call  bm^+\ . bm^,^  the  j th  segment  of p  i.  Next,  note  that  the  0th  segment  has 

incurred  at  least  r-l  migrations  due  to  visits.  Since  the  'ovel  number  decreases  by  at  least  1  in  every  visit,  we 
have  IdLbi).  Further,  all  segments  except  possibly  the  last  one  (i.e.,  (r-l)th),  were  acquired  by  p  i  in  just 
one  visit.  This  last  segment  may  have  recently  arrived  at  p  i.  As  level  numbers  decrease  in  every  migration, 
each  segment  of  p  i  must  have  originated  at  a  level  higher  than  its  current  level . 

For  x,y,  and  z  >0  such  that  x^y.  we  have  log(x  +z  Hog(A:  )^og(y  +z  )-log(y ),  and  hence 
r  log(x  +z  )1  -  r  log(x  )1  r  logCy  +Z  )1  -  r  log(y  >1  +1 .  Consequenay 

t)-l{b^.^u  r)+l. 

But  /o(6m^„)-^o(6m^+i)  is  the  number  of  possible  reporting  steps  incurred  in  introducing  the  y'th  segment 
into  the  system  at  po^.  So  summing  both  sides  of  the  above  inequality  we  get: 

<riogprti  +r<riogp/ii  +/o(i>i) 

At  this  point  we  make  the  steady  state  assumption  and  say  that  /o(6 1)  is  O  Gog  n ).  Thus  O  Gog  n )  reporting 
steps  occurred  during  the  addition  of  n  tasks  to  the  queue  of  p  i.  □ 
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Discussion 

Note  that  in  our  complexity  analyses,  each  access  to  F  was  priced  at  the  maximum  possible  of  O  Gog  k ) 
shared  memory  accesses,  and  hence  with  the  possibility  of  O  (log  k)  collisions.  Thus  the  Fibonacci  heap  is 
essential  to  the  result;  any  data  structure  allowing  max,  incr,  and  deer  in  <9 (log  k)  time  will  suffice.  We 
chose  Fibonacci  heaps  to  optimise  the  performance  as  much  as  possible.  It  may  be  noted  that  non-amortized 
versions  of  Fibonacci  heaps  are  now  available  [1]. 

V\OLve 

We^shown  that  it  takes  O  (k  log  k  log  «  )  visiting  collisions  following  step  t  to  process  the  n  tasks  in  the 
system  at  time  t.  Similarly  we  have  shown  that  it  takes  O  (k  log  k  log  n)  reporting  collisions  preceding  time 
t  to  arrive  at  a  configuration  of  n  tasks  at  time  t .  We  have  not  shown  that  over  O  {k  log  k  log  « )  collisions, 
the  system  completes  n  tasks;  this  is  not  true.  The  second  assertion  holds  only  when  we  assume  steady 
loading  conditions.  The  local  steady  state  assumption  of  Iq(J3\)=c  log  n  (c  a  constant)  appears  reasonable. 
The  constant  c  measures  how  steady  the  loading  is. 
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